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Abstract 

We present the dialogue module of the 
speech-to-speech translation system Verb- 
mobil. We follow the approach that the 
solution to dialogue processing in a medi- 
ating scenario can not depend on a single 
constrained processing tool, but on a com- 
bination of several simple, efficient, and ro- 
bust components. We show how our solu- 
tion to dialogue processing works when ap- 
plied to real data, and give some examples 
where our module contributes to the cor- 
rect translation from German to English. 

1 Introduction 

The implemented research prototype of the speech- 



to-speech translation system Verbmobil ( Wahlstcr 



1993 ; Bub and Schwinn, 1996 ) consists of more than 
40 modules for both speech and linguistic processing. 
The central storage for dialogue information within 
the overall system is the dialogue module that ex- 
changes data with 15 of the other modules. 

Basic notions within Verbmobil are turns and 
utterances. A turn is defined as one contribution of 
a dialogue participant. Each turn divides into utter- 
ances that sometimes resemble clauses as defined in 
a traditional grammar. However, since we deal ex- 
clusively with spoken, unconstrained contributions, 
utterances are sometimes just pieces of linguistic ma- 
terial. 

For the dialogue module, the most important di- 
alogue related information extrac ted for each uttcr - 
ance is the so called dialogue act fljckat ct al., 1995|) . 
Some dialogue acts describe solely the illocutionary 
force, while other more domain specific ones describe 
additionally aspects of the propositional content of 
an utterance. 
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Prior to the selection of the dialogue acts, we ana- 
lyzed dialogues from Verbmobil's corpus of spoken 
and transliterated scheduling dialogues. More than 
500 of them have been annotated with dialogue re- 
lated information and serve as the empirical founda- 
tion of our work. 

Throughout this paper we will refer to the exam- 
ple dialogue partly shown in figure |. The transla- 
tions are as the deep processing line of Verbmobil 
provides them. We also annotated the utterances 
with the dialogue acts as determined by the semantic 
evaluation module. ''//'' shows where utterance 
boundaries were determined. 

We start with a brief introduction to dialogue pro- 
cessing in the Verbmobil setting. Section || intro- 
duces the basic data structures followed by two sec- 
tions describing some of the tasks which are carried 
out within the dialogue module. Before the con- 
cluding remarks in section ||, we discuss aspects of 
robustness and compare our approach to other sys- 
tems. 

2 Introduction to Dialogue 
Processing in Verbmobil 

In contrast to many other NL-systems, the Verb- 
mobil system is mediating a dialogue between two 
persons. No restrictions are put on the locutors, ex- 
cept for the limitation to stick to the approx. 2500 
words Verbmobil recognizes. Therefore, Verbmo- 
bil and especially its dialogue component has to fol- 
low the dialogue in any direction. In addition, the 
dialogue module is faced with incomplete and incor- 
rect input, and sometimes even gaps. 

When designing a component for such a scenario, 
we have chosen not to use one big constrained pro- 
cessing tool. Instead, we have selected a combina- 
tion of several simple and efficient approaches, which 
together form a robust and efficient processing plat- 
form. 

As an effect of the mediating scenario, our mod- 



A01: Tag // Herr Scheytt . 

(greet, introduce_name) 
(Hello, Mr Scheytt) 

B02: Guten Tag // Frau Klein // Wir miissen 
noch einen Termin ausmachen // fiir die 
Mitarbeiterbesprechung . 

(greet, introduce_name, initjdate, 

suggest_support_date) 

(Hello, Mrs. Klein, we should arrange an 

appointment, for the team meeting) 

A03: Ja,// ich wiirde Ihnen vorschlagen im 
Januar,// zwischen dem fiinfzehnten und 
neunzehnten. 

(uptake, suggest_support_date, 

request_comment_date) 

( Well, I would suggest in January, between the 

fifteenth and the nineteenth) 

B04: Oh // das ist ganz schlecht. // 
zwischen dem elften und achtzehnten Januar 
bin ich in Hamburg. 

(uptake, reject_date, suggest_support_date) 
(Oh, that is really inconvenient, I'm in Hamburg 
between the eighteenth of January and the eleventh, ) 

A09: Doch ich habe Zeit von sechsten Februar 
bis neunten Februar 

(suggest_support_date) 

(/ have time afterall from the 6th of February to the 
9th of February) 



BIO: Sehr gut // das pafit bei mir auch // 
Dann machen wir's gleich aus // fiir 
Donnerstag // den achten // Wie ware es denn 
urn acht Uhr dreifiig // 

(feedback_acknowledgement, accept_date, 
init_date, suggest_support_date, 
suggest_support_date, suggest_support_date) 
( Very good, that too suits me, we will arrange for it, 
for thursday, the eighth, how about half past eighth) 

All: Am achten // ginge es bei mir leider 
nur bis zehn Uhr // Bei mir geht es besser 
nachmittags . 

(SUGGEST_SUPPORT_DATE, SUGGEST_SUPPORT_DATE, 
ACCEPT_DATE) 

(on the eighth, Is it only unfortunately possible for- 
me until 10 o'clock, It suits me better in the 
afternoon ) 

B12: gut //urn wieviel Uhr sollen wir uns 
dann treffen ? 

(feedback_acknowledgement, 
suggest_support_date) 
(good, when should we meet) 

A13: ich w"urde "ahm vierzehn Uhr 
vorschlagen // geht es bei Ihnen. 

(suggest_support_date,request_comment_date) 
( / would suggest 2 o'clock, is that possible for you?) 

B14: sehr gut // das pa"st bei mir auch // 
das k"onnen wir festhalten 

(accept_date,accept_date,accept_date) 

(very good, that suits me too, we can make a note of 

that) 



Figure 1: An example dialogue 



ule cannot serve as a "dialogue controller" like in 
man-machine dialogues. The only exception is when 
clarification dialogues are necessary between Verb- 
mobil and a user. 

Due to its role as information server in the overall 
Verbmobil system, we started early in the project 
to collect requirements from other components in 
the system. The result can be divided into three 
subtasks: 

• we allow for other components to store and re- 
trieve context information. 

• we draw inferences on the basis of our input. 

• we predict what is going to happen next. 

Moreover, within Verbmobil there are different 
processing tracks: parallel to the deep, linguistic 
based processing, different shallow processing mod- 
ules also enter information into, and retrieve it from, 
the dialogue module. The data from these parallel 
tracks must be consistently stored and made acces- 
sible in a uniform manner. 



Figure Q shows a screen dump of the graphical 
user interface of our component while processing the 
example dialogue. In the upper left corner we see the 
structures of the dialogue sequence memory, where 
the middle right row represents turns, and the left 
and right rows represent utterances as segmented 
by different analysis components. The upper right 
part shows the intentional structure built by the plan 
recognizer. Our module contains two instances of a 
finite state automaton. The one in the lower left 
corner is used for performing clarification dialogues, 
and the other for visualization purposes (see section 
[?]). The thematic structure representing temporal 
expressions is displayed in the lower right corner. 

3 Maintaining Context 

As basis for storing context information we devel- 
oped the dialogue sequence memory. It is a generic 
structure which mirrors the sequential order of turns 
and utterances. A wide range of operation has been 
defined on this structure. For each turn, we store 
e.g. the speaker identification, the language of the 




Figure 3: A part of the sequence memory 



contribution, the processing track finally selected 
for translation, and the number of translated utter- 
ances. For the utterances we store e.g. the dialogue 
act, dialogue phase, and predictions. These data are 
partly provided by other modules of Verbmobil or 
computed within the dialogue module itself (see be- 
low). 

Figure || shows the dialogue sequence memory af- 
ter the processing of turn B02. For the deep anal- 
ysis side (to the right), the turn is segmented into 
four utterances: Guten Tag // Frau Klein // Wir 
miissen noch einen Termin ausmachen // fur die 
Mitarbeiterbesprechung, for which the semantic eval- 
uation component has assigned the dialogue acts 
Greet, Introduce_Name, Init_Date, and Sug- 
gest_Support_Date respectively. To the left we 
see the results of one of the shallow analysis com- 
ponents. It splits up the input into two utterances 
Guten Tag Frau Klein // Wir miissen ...die Mi- 
tarbeiterbesprechung and assigns the dialogue acts 
Greet and Init_Date. 

The need for and use of this structure is high- 
lighted by the following example. In the domain of 
appointment scheduling the German phrase Geht es 
bei Ihnen? is ambiguous: bei Ihnen can either re- 
fer to a location, in which case the translation is 
Would it be okay at your place? or, to a certain 
time. In the latter case the correct translation is Is 
that possible for you?. A simple way of disambiguat- 
ing this is to look at the preceding dialogue act(s). 
In our example dialogue, turn A13, the utterance 
ich wiirde ahm vierzehn Uhr vorschlagen (I would 
hmm fourteen o 'clock suggest) contains the proposal 
of a time, which is characterized by the dialogue act 
SUGGEST_SUPPORT_date. With this dialogue act in 
the immediately preceding context the ambiguity is 
resolved as referring to a time and the correct trans- 
lation is determined. 

In our domain, in addition to the dialogue act the 
most important propositional information are the 
dates as proposed, rejected, and finally accepted by 
the users of Verbmobil. While it is the task of the 
semantic evaluation module to extract time informa- 
tion from the actual utterances, the dialogue module 
integrates those information in its thematic mem- 
ory. This includes resolving relative time expres- 
sions, e.g. two weeks ago, into precise time descrip- 
tions, like "23rd week of 1996". The information 
about the dates is split in a specialization hierarchy. 
Each date to be negotiated serves as a root, while 
the nodes represent the information about years, 
months, weeks, days, days of week, period of day 
and finally time. Each node contains also informa- 
tion about the attitude of the dialogue participants 



concerning this certain item: proposed, rejected, or 
accepted by one of the participants. 

Figure || shows parts of the thematic structure 
after the processing of turn BIO. The black boxes 
stand for the date currently under consideration. 
Thursday, 8., is the current date agreed upon. We 
also see the previously proposed interval from 6.-9. 
of the same month in the box above (FR0M_T0 (6,9)). 

4 Inferences 

Besides the mere storage of dialogue related data, 
there are also inference mechanisms integrating the 
data in representations of different aspects of the 
dialogue. These data are again stored in the context 
memories shown above and are accessed by the other 
Verbmobil modules. 

Plan Based Inferences 

Inspecting our corpus, we can distinguish three 
phases in most of the dialogues. In the first, the 
opening phase, the locutors greet each other and the 
topic of the dialogue is introduced. The dialogue 
then proceeds into the negotiation phase, where the 
actual negotiation takes place. It concludes in the 
closing phase where the negotiated topic is confirmed 
and the locutors say goodbye. This phase informa- 
tion contributes to the correct transfer of an utter- 
ance. For example, the German utterance Guten 
Tag is translated to "Hello" in the greeting phase, 
and to "Good day" in the closing phase. 

The task of determining the phase of the dialogue 



has been given to the plan recognizer (Alexander- 
|sson, 199E| ). It builds a tree like structure which 
we call the intentional structure. The current ver- 
sion makes use of plan operators both hand coded 
and automatically derived from the Verbmobil cor- 
pus. The method used is transferred from the field of 
grammar extraction ( Stolcke, 1994 ). To contribute 
to the robustness of the system, the processing of 
the recognizer is divided into several processing lev- 
els like the "turn level" and the "domain dependent 
level" . The concepts of turn levels and the automatic 



acquisition of operators are described in ( Alexander- 
|sson, 1996| ). 

In figure || we see the structure after processing 
turns B02 and A03. The leaves of the tree are the 
dialogue acts. The root node of the left subtree for 
B02 is a GREE(T) -INIT- . . . operator which belongs 
to the greeting phase, while the partly visible one to 
the right belongs to the negotiation phase. 

In the example used in this paper we are process- 
ing a "well formed" dialogue, so the turn structure 
can be linked into a structure spanning over the 
whole dialogue. We also see in figure || how the 
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Figure 4: Day/Day-of-Week detail of the thematic structure 



phase information has been written into the boxes 
representing the utterances of turn B02 as segmented 
by the deep analysis. 

Thematic Inferences 

In scheduling dialogues, referring expressions like 
the German word ndchste occur frequently. Depend- 
ing on the thematic structure it can be translated as 
next if the date referred to is immediately after the 
speaking time, or following in the other cases. The 
thematic structure is mainly used to resolve this type 
of anaphoric expressions if requested by the semantic 
evaluation or the transfer module. The information 
about the relation between the date under consid- 
eration and the speaking time can be immediately 
computed from the thematic structure. 

The thematic structure is also used to check 
whether the time expressions are correctly recog- 
nized. If some implausible dates are recognized, e.g. 
April, 31., a clarification can be invoked. The sys- 
tem proposes the speaker a more plausible date, and 
waits for an acceptance or rejection of the proposal. 
In the first case, the correct date will be translated, 



in the latter, the user is asked to repeat the whole 
turn. 

Using the current state of the thematic structure 
and the dialogue act in combination with the time 
information of an utterance, multiple readings can 
be inferred (Maier, 1996). For example, if both lo- 



cutors propose different dates, an implicit rejection 
of the former date can be assumed. 

5 Predictions 

A different type of inference is used to generate pre- 
dictions about what comes next. While the plan- 
based component uses declarative knowledge, albeit 
acquired automatically, dialogue act predictions are 
based solely on the annotated Verbmobil corpus. 
The computation uses the conditional frequencies of 
dialogue act sequences to compute pr obabilities of 
the most likely follow-up dialogue acts ( Reithinger et 
aL, 1996), a meth od adapted from language model- 
ing ( Jelinek, 199C ). As described above, the dialogue 
sequence memory serves as the central repository for 
this information. 
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Figure 5: Intentional structure for two turns 



The sequence memory in figure [| shows in addi 
tion to the actual recognized dialogue act als o the 
predictions for the following utterance. In (Rci 



thinger et al., 1996) it is demonstrated that ex- 



ploiting the speaker direction significantly enhances 
the prediction reliability. Therefore, predictions are 
computed for both speakers. The numbers after the 
predicted dialogue acts show the prediction proba- 
bilities times 1000. 

As can be seen in the figure, the actually recog- 
nized dialogue acts are, for this turn, among the two 
most probable predicted acts. Overall, approx. 74% 
of all recognized dialogue acts are within the first 
three predicted ones. 

Major consumers of the predictions are the seman- 
tic evaluation module, and the shallow translation 
module. The former module that uses mainly knowl- 
edge based methods to determine the dialogue act of 
an utterance exploits the predictions to narrow down 
the number of possible acts to consider. The shallow 
translation module integrates the predictions within 
a Bayesian classifier to compute dialogue acts di- 
rectly from the word string. 

6 Robustness 

For the dialogue module there are two major points 
of insecurity during operation. On the one hand, 



the user's dialogue behaviour cannot be controlled. 
On the other hand, the segmentation as computed 
by the syntactic-semantic construction module, and 
the dialogue acts as computed by the semantic evalu- 
ation module, are very often not the ones a linguistic 
analysis on the paper will produce. Our example di- 
alogue is a very good example for the latter problem. 

Since no module in Verbmobil must ever crash, 
we had to apply various methods to get a high degree 
of robustness. The most knowledge intensive module 
is the plan recognizer. The robustness of this sub- 
component is ensured by dividing the construction of 
the intentional structure into several processing lev- 
els. Additionally, at the turn level the operators are 
learned from the annotated corpus. If the construc- 
tion of parts of the structure fails, some functionality 
has been developed to recover. An important ingre- 
diencc of the processing is the notion of repair - if 
the plan construction is faced with something unex- 
pected, it uses a set of specialized repair operators to 
recover. If parts of the structure could not be built, 
we can estimate on the basis of predictions what the 
gap consisted of. 

The statistical knowledge base for the prediction 
algorithm is trained on the Verbmobil corpus that 
in its major parts contains well-behaved dialogues. 
Although prediction quality gets worse if a sequence 



of dialogue acts has never been seen, the interpola- 
tion approach to compute the predictions still deliv- 
ers useful data. 

As mentioned above, to contribute to the correct- 
ness of the overall system we perform different kinds 
of clarification dialogues with the user. In addi- 
tion to the inconsistent dates, we also e.g. recognize 
similar words in the input that will be most likely 
exchanged by the speech recognizer. Examples are 
the German words for thirteenth (dreizehnter) and 
thirtieth (dreifligster) . Within a uniform computer- 
human interaction, we resolve these problems. 

7 Related Work 



I n the speech-to-s peech translation system Janus 
( Lavie et ah, 1996 ), two different approaches, a plan 
based and an automaton based, to model dialogues 
have been implemented. Currently, only one is used 



at a time. For Verbmobil, ( Alexandersson and Re 



ithinger, 1995) showed that the descriptive power 



of the plan recognizer and the predictive power of 
the statistical component makes the automaton ob- 
solete. 

The automatic acquisition o f a dialogue mode l 
from a corpus is reported in ( Kita et al., 1996 ). 



They extract a probabilistic automaton using an an- 
notated corpus of up to 60 dialogues. The transitions 
correspond to dialogue acts. This method captures 
only local discourse structures, whereas the plan 
based approach of Verbmobil also allows for the 
description of global structures. Comparable struc- 
tures are also defined in the dialogue processing of 
Trains ( Traum and Allen, 1992| ). However, they 
are defined manually and have not been tested on 
larger data sets. 

8 Conclusion and Future Work 

Dialogue processing in a speech-to-speech transla- 
tion system like Verbmobil requires innovative and 
robust methods. In this paper we presented differ- 
ent aspects of the dialogue module while processing 
one example dialog. The combination of knowledge 
based and statistical methods resulted in a reliable 
system. Using the Verbmobil corpus as empirical 
basis for training and test purposes significantly im- 
proved the functionality and robustness of our mod- 
ule, and allowed for focusing our efforts on real prob- 
lems. The system is fully integrated in the Verbmo- 
bil system and has been tested on several thousands 
of utterances. 

Nevertheless, processing in the real system cre- 
ates still new challenges. One problem that has to 
be tackled in the future is the segmentation of turns 



into utterances. Currently, turns are very often split 
up into too many and too small utterances. In the 
future, we will have to focus on the problem of "glue- 
ing" fragments together. When given back to the 
transfer and generation modules, this will enhance 
translation quality. 

Future work includes also more training and the 
ability to handle sparse data. Although we use one of 
the largest annotated corpora available, for purposes 
like training we still need more data. 
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